AFRL-AFOSR-JP-TR-2016-0073 Large-scale Linear Optimization through Machine Learning: From Theory to Practical System Design and Implementation
نویسنده
چکیده
The linear programming (LP) is one of the most popular necessary optimization tool used for data analytics as well as in various scientific fields. However, the current state-of-art algorithms suffer from scalability issues when processing Big Data. For example, the commercial optimization software IBM CPLEX cannot handle an LP with more than hundreds of thousands variables or constraints. Existing algorithms are fundamentally hard to scale because they are inevitably too complex to parallelize. To address the issue, we study the possibility of using the Belief Propagation (BP) algorithm as an LP solver. BP has shown remarkable performances on various machine learning tasks and it naturally lends itself to fast parallel implementations. Despite this, very little work has been done in this area. In particular, while it is generally believed that BP implicitly solves an optimization problem, it is not well understood under what conditions the solution to a BP converges to that of a corresponding LP formulation. Our efforts consist of two main parts. First, we perform a theoretic study and establish the conditions in which BP can solve LP [1,2]. Although there has been several works studying the relation between BP and LP for certain instances, our work provides a generic condition unifying all prior works for generic LP. Second, utilizing our theoretical results, we develop a practical BP-based parallel algorithms for solving generic LPs, and it shows 71x speed up while sacrificing only 0.1% accuracy compared to the state-of-art exact algorithm [3, 4]. As a result of the study, the PIs have published two conference papers [1,3] and two follow-up journal papers [3,4] are under submission. We refer the readers to our published work [1,3] for details. Introduction: The main goal of our research is to develop a distributed and parallel algorithm for large-scale linear optimization (or programming). Considering the popularity and importance of linear optimizations in various fields, the proposed method has great potentials applicable to various big data analytics. Our approach is based on the Belief Propagation (BP) algorithm, which has shown remarkable performances on various machine learning tasks and naturally lends itself to fast parallel implementations. Our key contributions are summarized below: 1) We establish key theoretic foundations in the area of Belief Propagation. In particular, we show that BP converges to the solution of LP if some sufficient conditions are satisfied. Our DISTRIBUTION A. Approved for public release: distribution unlimited. conditions not only cover various prior studies including maximum weight matching, mincost network flow, shortest path, etc., but also discover new applications such as vertex cover and traveling salesman. 2) While the theoretic study provides understanding of the nature of BP, it falls short in slow convergence speed, oscillation and wrong convergence. To make BP-based algorithms more practical, we design a BP-based framework which uses BP as a ‘weight transformer’ to resolve the convergence issue of BP. We refer the readers to our published work [1, 3] for details. The rest of the report contains a summary of our work appeared in UAI (Uncertainty in Artificial Intelligence) and IEEE Conference in Big Data [1,3] and follow up work [2,4] under submission to major journals. Experiment: We first establish theoretical conditions when Belief Propagation (BP) can solve Linear Programming (LP), and second provide a practical distributed/parallel BP-based framework solving generic optimizations. We demonstrate the wide-applicability of our approach via popular combinatorial optimizations including maximum weight matching, shortest path, traveling salesman, cycle packing and vertex cover. Results and Discussion: Our contribution consists of two parts: Study 1 [1,2] looks at the theoretical conditions that BP converges to the solution of LP. Our theoretical result unify almost all prior result about BP for combinatorial optimization. Furthermore, our conditions provide a guideline for designing distributed algorithm for combinatorial optimization problems. Study 2 [3,4] focuses on building an optimal framework based on the theory of Study 1 for boosting the practical performance of BP. Our framework is generic, thus, it can be easily extended to various optimization problems. We also compare the empirical performance of our framework to other heuristics and state of the art algorithms for several combinatorial optimization problems. -------------------------------------------------------Study 1 -------------------------------------------------------We first introduce the background for our contributions. A joint distribution of � (binary) variables � = [��] ∈ {0,1}� is called graphical model (GM) if it factorizes as follows: for � = [��] ∈ {0,1}�, where ψψ� ,�� are some non-negative functions so called factors; � is a collection of subsets (each αα� is a subset of {1,⋯ ,�} with |��| ≥ 2; �� is the projection of � onto dimensions included in αα. Assignment �∗ is called maximum-a-posteriori (MAP) assignment if �∗maximizes the probability. The following figure depicts the graphical relation between factors � and variables �. DISTRIBUTION A. Approved for public release: distribution unlimited. Figure 1: Factor graph for the graphical model with factors αα1 = {1,3},�2 = {1,2,4},�3 = {2,3,4} Now we introduce the algorithm, (max-product) BP, for approximating MAP assignment in a graphical model. BP is an iterative procedure; at each iteration �, there are four messages between each variable �� and every associated αα ∈ ��, where ��: = {� ∈ �:� ∈ �}. Then, messages are updated as follows: Finally, given messages, BP marginal beliefs are computed as follows: Then, BP outputs the approximated MAP assignment ��� = [��] as Now, we are ready to introduce the main result of Study 1. Consider the following GM: for � = [��] ∈ {0,1}� and � = [��] ∈ ��, where the factor function ψψαα for αα ∈ � is defined as for some matrices ��,�� and vectors ��,��. Consider the Linear Programming (LP) corresponding the above GM: One can easily observe that the MAP assignments for GM corresponds to the (optimal) solution of the above LP if the LP has an integral solution �∗ ∈ {0,1}�. The following theorem is our main result of Study 1 which provide sufficient conditions so that BP can indeed find the LP solution DISTRIBUTION A. Approved for public release: distribution unlimited. Theorem 1 can be applied to several combinatorial optimization problems including matching, network flow, shortest path, vertex cover, etc. See [1,2] for the detailed proof of Theorem 1 and its applications to various combinatorial optimizations including maximum weight matching, min-cost network flow, shortest path, vertex cover and traveling salesman. -------------------------------------------------------Study 2 -------------------------------------------------------Study 2 mainly focuses on providing a distributed generic BP-based combinatorial optimization solver which has high accuracy and low computational complexity. In summary, the key contributions of Study 2 are as follows: 1) Practical BP-based algorithm design: To the best of our knowledge, this paper is the first to propose a generic concept for designing BP-based algorithms that solve large-scale combinatorial optimization problems. 2) Parallel implementation: We also demonstrate that the algorithm is easily parallelizable. For the maximum weighted matching problem, this translates to 71x speed up while sacrificing only 0.1% accuracy compared to the state-of-art exact algorithm. 3) Extensive empirical evaluation: We evaluate our algorithms on three different combinatorial optimization problems on diverse synthetic and real-world data-sets. Our evaluation shows that the framework shows higher accuracy compared to other known heuristics. Designing a BP-based algorithm for some problem is easy in general. However (a) it might diverge or converge very slowly, (b) even if it converges quickly, the BP decision might be not correct, and (c) even worse, BP might produce an infeasible solution, i.e., it does not satisfy the constraints of the problem. DISTRIBUTION A. Approved for public release: distribution unlimited. Figure 2: Overview of our generic BP-based framework To address these issues, we propose a generic BP-based framework that provides highly accurate approximate solutions for combinatorial optimization problems. The framework has two steps, as shown in Figure 2. In the first phase, it runs a BP algorithm for a fixed number of iterations without waiting for convergence. Then, the second phase runs a known heuristic using BP beliefs instead of the original weights to output a feasible solution. Namely, the first and second phases are respectively designed for ‘BP weight transforming’ and ‘post-processing’. Note that our evaluation mainly uses the maximum weight matching problem. The formal description of the maximum weight matching (MWM) problem is as follows: Given a graph � = (�,�) and edge weights � = [��] ∈ �|�|, it finds a set of edges such that each vertex is connected to at most one edge in the set and the sum of edge weights in the set is maximized. The problem is formulated as the following IP (Integer Programming): where δδ(�) is the set of edges incident to vertex � ∈ �. In the following paragraphs, we describe the two phases in more detail in reverse order. We first describe the post-processing phase. As we mentioned, one of the main issue of a BP-based algorithm is that the decision on BP beliefs might give an infeasible solution. To resolve the issue, we use post-processing by utilizing existing heuristics to the given problem that find a feasible solution. Applying post-processing ensures that the solution is at least feasible. In addition, our key idea is to replace the original weights by the logarithm of BP beliefs, i.e. function of (3). After this, we apply known heuristics using the logarithm of BP beliefs to achieve higher accuracy. To confirm the effectiveness of the proposed post-processing mechanism, we compare it with the following two alternative post-processing schemes for the maximum weight matching problem that remove edges to enforce matching after BP processing in a naive manner: Random: If there exists a vertex � such that more than one neighboring edges are selected on the BP decision, randomly select one edge and remove other edges. Weight: If there exists a vertex � such that more than one neighboring edges are selected on the BP decision, remove edges of smaller weight. Figure 3 compares the approximation ratio obtained using BP-belief-based post-processing versus the naive post-processing heuristics (random and weight). It shows that the proposed BP-belief-based post-processing outperforms the rest. Note, the results in Figure 3 were obtained by first applying BP message passing for weight transformation. Next, we explain how this is done in our framework. DISTRIBUTION A. Approved for public release: distribution unlimited. Figure 3: a) Average approximation ratio for different post-processing schemes. We use a local greedy algorithm as a post-processing based on original weights and BP messages (i.e., beliefs). The ‘Random selection’ post-processing is also compared. b) Effects of initial messages on the convergence of BP. We set ���: = ��� ��→(�,�)(0) ��→(�,�)(1) = ���� where x-axis represent the value of�. c) Approximation ratio for different initial messages ��� = 0,���/2,���. Now, we describe the BP weight transforming phase. To improve the approximation quality and solve the convergence issues, we use three modifications to the standard BP algorithm: (1) careful initialization on messages, (2) noise addition and (3) hybrid damping on message updates. Message Initialization. The standard message initialization is ��→� = ��→� = 1 for the maximum weight matching problem. However, the convergence rate of BP depends on the initialized messages. As reported in Figure 4, we try different initializations by varying the log ratio ���: = ��� ��→(�,�)(0) ��→(�,�)(1) = ���� for 0 ≤ � ≤ 1, where the case � = 0.5 shows the fastest convergence. The choice ��� = 0.5��� alleviates the fluctuation behavior of BP and boosts up its convergence speed. We remind that, under our framework, BP runs only for a fixed number of iterations since it might converge too slowly, even with the initialization ��� = 0.5��� for practical purposes. With fixed number of iterations, careful initialization becomes even more critical as experimental results in Figure 3(c) and Figure 4 suggest. For example, if one runs 5000 iterations of BP, they show that the standard initialization achieves at most 30% approximation ratio, while the proposed method achieves 99%. Moreover, one can also observe that the advantage of more BP updates diminishes as the number of iterations Figure 4: Effects of initial messages on the number of BP iterations. We set ��� = ���� for a value c of x-axis. Noise Addition. The BP algorithm often oscillates when the MAP solution is not unique. To address this issue, we transform the original problem to one that has a unique solution with high probability by adding small noises to the weights. We apply this to all cases. Here, one has to be careful in deciding the range � of noises. If � is too large, the quality of BP solution deteriorates because the optimal solution might have changed from the original problem. On the other hand, if � is too small DISTRIBUTION A. Approved for public release: distribution unlimited. compared to we, BP converges very slowly. To achieve a balance, we choose the range r of noise re as 10% of the minimum distance among weights. We find that this results in over 99.8% approximation ratio even when the solution is not unique, which has little difference with that of unique solution as shown in Table I. Table I: Approximation ratio of BP for MWM with multiple optima and a unique optimum. We introduce a small noise to the edge weights and set the initial message by ��� = ���/2. Hybrid Damping. To boost up the convergence speed of BP updates further, we use a specific damping strategy to alleviate message oscillation. We update messages to be the average of old and new messages as follows: We note that the damping strategy provides a similar effect as our proposed initialization ��� = ���/2. Hence, if one uses both, the effect of one might be degraded due to the other. Due to this, we first run the half of BP iterations without damping (this is for keeping the effect of the proposed initialization) and perform the last half of BP iterations with damping. As reported in Table II, this hybrid approach outperforms other alternatives, including (a) no use of damping, (b) using damping in every iteration, and (c) damping in the first half of BP iterations and no-damping in the last half. Table II: Approximation ratio of BP without damping, BP with damping, BP with damping only for first 50 iterations, and BP with damping for last 50 iterations. We introduce a small noise to the edge weights and set the initial message by ��� = ���/2. Now we describe the implementation, mostly parallelization, of our framework. First, we introduce asynchronous message update that enables efficient parallelization of BP message passing. Second, we illustrate the issues in parallelizing post-processing. Finally, we describe the parallel implementations of our algorithm and their benefits. Asynchronous Message Update. For parallelization, we first divide the graph by partitioning the vertices, and assign each partition to a single thread. However, if we naively parallelize the process using multiple threads, frequent synchronization may incur large overhead. Thus, we apply asynchronous message update where each vertex updates the message value right after new message value is calculated and eliminate synchronization point between iterations. This makes the process faster because of the reduced synchronization points. Figure 5 shows that performance improvement DISTRIBUTION A. Approved for public release: distribution unlimited. (speed up in running time) of asynchronous update over synchronous is up to 237% in our example graph for the maximum weight matching problem with 16 threads Figure 5: Average running time of our BP-based algorithm with synchronous message update and asynchronous message update. Local Post-Processing. The second phase of our algorithm runs existing heuristics for post-processing to enforce the feasibility of BP decisions. While the framework works with any heuristics-based postprocessing methods, for the entire process to be parallel, it is important that the post-processing step is also parallel. An important criterion for efficient parallelization is locality of computation; if the postprocessing heuristics can compute the result locally without requiring global knowledge, they can be easily parallelized. Moreover, if they do not require synchronization, the running time can be further reduced. Parallel Implementation. The BP algorithm is easy to parallelize because of its message passing nature. To demonstrate this, we parallelize our BP-based framework using three platforms: GraphChi, OpenMP and pthread. Now we show the empirical performance of our framework using three popular combinatorial optimization problems: maximum weight matching, minimum weight vertex cover (MWVC) and maximum weight independent set problem (MWIS). We already introduced the IP formulation of the maximum weight matching (MWM), where those of the minimum weight vertex cover (MWVC) and maximum weight independent set problem (MWIS) are as follows: Experimental Setup. In our experiments, both real-world and synthetic datasets are used for evaluation. For MWM, we used data-sets from the university of Florida sparse matrix collection. For larger scale synthetic evaluation, we generate Erdős-Rényi random graphs (up to 50 million vertices with 2.5 billion edges) with average vertex degree of 100 with edge weights drawn independently from the uniform random distribution over the interval [0, 1]. For MWVC and MWIS, we use the frbseries from BHOSLIB, where it also contains the optimal solutions. We note that we perform no experiment using synthetic data-sets for MWVC and MWIS since they are NP-hard problems, i.e., impossible to compute the optimal solutions. On the other hand, for MWM the Edmonds’ Blossom algorithm can compute the optimal solution in polynomial time. All experiments in this section are DISTRIBUTION A. Approved for public release: distribution unlimited. conducted on a machine with Intel Xeon(R) CPU E5-2690 @ 2.90GHz with 8 cores and 8 hyperthreads with 128GB of memory, unless otherwise noted. Approximation Ratio. We now demonstrate our BP-based approximation algorithm produces highly accurate results. In particular, we show that our BP-based algorithms outperform well-known heuristics for MWVC, MWIS and closely approximate exact solutions for MWM for all cases we evaluate. For MWM, we compare the approximation qualities of serial, synchronous BP and parallel, asynchronous implementation on both synthetic and real-world data-sets, where we compute the optimal solution using the Blossom algorithm to measure the approximation ratios. Table III summarize our experimental results for MWM for the synthetic data-sets and the Florida data. Our BP-based algorithm achieves 99% to 99.9% approximation ratios. Table III: MWM: Approximation ratio of our BP-based algorithm on synthetic and sparse matrix collection data-sets For MWVC, we use two post-processing procedures: greedy and 2-approximation algorithm. For the local greedy algorithm, we choose a random edge and add one of its adjacent vertices with a smaller weight until all edges are covered. We compare the approximation qualities of our BP-based algorithm compared to the cases when one uses only the greedy algorithm and the 2-approximation algorithm. Figure 6 shows the experimental results for the two postprocessing heuristics. The results show that our BP-based weight transformation enhances the approximation quality of known approximation heuristics by up to 43%. Figure 6: MWVC: Average approximation ratio of our BP-based algorithm, the 2-approximation algorithm and the greedy algorithm on frb-series data-sets. For MWIS, the experiment was performed on frb-series data-sets. We use a greedy algorithm as the post-processing procedure, which selects vertices in the order of higher weights until no DISTRIBUTION A. Approved for public release: distribution unlimited. vertex can be selected without violating the independent set constraint. We compare the approximation qualities of our BP-based algorithm and the standard greedy algorithm. Figure 7 shows that our BP-based framework enhances the approximation ratio of the solution by 2% to 23%. Figure 7: MWIS: Average approximation ratio of our BP-based algorithm and the greedy algorithm on frb-series data-sets. Parallelization Speed-up. Figure 8 compares the running time of the Blossom algorithm and our BPbased algorithm with 1 single core and 16 cores. With five million vertices, our asynchronous parallel implementation is eight times faster than the synchronous serial implementation, while still retaining 99.9% approximation ratio as reported in Table III. To demonstrate the overall benefit in context, we compare its running time with that of the current fastest implementation of the Blossom algorithm due to Kolmogorov. Here, we note that the Blossom algorithm is inherently not easy to parallelize. For parallel implementation, we report results for our pthread implementation, but the OpenMP implementation also show comparable performance. For 20 million vertices (one billion edges), it shows that the running time of our algorithm can be accelerated by up to 71 times than the Blossom algorithm, while sacrificing 0.1% of accuracy. The running time gap is expected be more significant for larger graphs since the running times of our algorithm and the Blossom algorithm are linear and cubic with respect to the number of vertices, respectively. Figure 8: MWM: Running time of Blossom algorithm and our BP-based algorithms. Large-scale Optimization. Our algorithm can also handle large-scale instances because it is based on GMs that inherently lend itself to parallel and distributed implementations. To demonstrate this, we create a large-scale instance containing up to 50 million vertices and 2.5 billion edges. We experiment our algorithm using GraphChi on a single consumer level machine with i7 CPU and 24GB of memory. Figure 9 shows the running time and memory usage of our algorithm for MWM and MWVC on large data-sets. DISTRIBUTION A. Approved for public release: distribution unlimited. Figure 9: MWM and MWVC: Running time and memory usage of GraphChi-based implementation on large-scale graphs. List of Publications and Significant Collaborations that resulted from your AOARD supported project: In standard format showing authors, title, journal, issue, pages, and date, for each category list the following: a) papers published in peer-reviewed journals, b) papers published in peer-reviewed conference proceedings, [1] Sejun Park and Jinwoo Shin, Max-Product Belief Propagation for Linear Programming: Applications to Combinatorial Optimization, Conference on Uncertainty in Artificial Intelligence (UAI) 2015 [3] Inho Cho, Soya Park, Sejun Park, Dongsu Han and Jinwoo Shin, Practical Message-passing Framework for Large-scale Combinatorial Optimization, IEEE International Conference on Big Data (IEEE BigData) 2015 c) papers published in non-peer-reviewed journals and conference proceedings, d) conference presentations without papers, e) manuscripts submitted but not yet published, and [2] Sejun Park and Jinwoo Shin, Convergence and Correctness of Max-Product Belief Propagation for Linear Programming, under the second round of revision in SIAM J. Discrete Math (SIDMA) [4] Inho Cho, Soya Park, Sejun Park, Dongsu Han and Jinwoo Shin, Large-scale Combinatorial Optimization via Belief Propagation: Practical Perspective, submitted to IEEE Transaction on Parallel and Distributed Systems (TPDS) f) provide a list any interactions with industry or with Air Force Research Laboratory scientists or significant collaborations that resulted from this work. Attachments: Publications a), b) and c) listed above if possible. DISTRIBUTION A. Approved for public release: distribution unlimited. Max-Product Belief Propagation for Linear Programming: Applications to Combinatorial Optimization Sejun Park Jinwoo Shin Department of Electrical Engineering Department of Electrical Engineering Korea Advanced Institute of Science and Technology Korea Advanced Institute of Science and Technology [email protected] [email protected]
ذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016